Enhancing genomic prediction with Stacking Ensemble Learning in Arabica Coffee

Coffee Breeding programs have traditionally relied on observing plant characteristics over years, a slow and costly process. Genomic selection (GS) offers a DNA-based alternative for faster selection of superior cultivars. Stacking Ensemble Learning (SEL) combines multiple models for potentially even more accurate selection. This study explores SEL potential in coffee breeding, aiming to improve prediction accuracy for important traits [yield (YL), total number of the fruits (NF), leaf miner infestation (LM), and cercosporiosis incidence (Cer)] in Coffea Arabica. We analyzed data from 195 individuals genotyped for 21,211 single-nucleotide polymorphism (SNP) markers. To comprehensively assess model performance, we employed a cross-validation (CV) scheme. Genomic Best Linear Unbiased Prediction (GBLUP), multivariate adaptive regression splines (MARS), Quantile Random Forest (QRF), and Random Forest (RF) served as base learners. For the meta-learner within the SEL framework, various options were explored, including Ridge Regression, RF, GBLUP, and Single Average. The SEL method was able to predict the predictive ability (PA) of important traits in Coffea Arabica. SEL presented higher PA compared with those obtained for all base learner methods. The gains in PA in relation to GBLUP were 87.44% (the ratio between the PA obtained from best Stacking model and the GBLUP), 37.83%, 199.82%, and 14.59% for YL, NF, LM and Cer, respectively. Overall, SEL presents a promising approach for GS. By combining predictions from multiple models, SEL can potentially enhance the PA of GS for complex traits.


Introduction
Coffee is one of the most globally beverages presenting importance in terms of its potential health, socioeconomic, and economic effects (Porto et al., 2019).These effects drive breeding programs worldwide to develop high-yielding, adaptable cultivars delivering superior bean quality (Barbosa et al., 2019).However, traditional selection methods based on phenotypic observations of the plants or their family history (pedigree) are expensive and time consuming, especially for perennial crops as coffee.
An alternative approach denoted genomic selection (GS) has been used as a successful tool in genetic improvement (Meuwissen et al., 2001).GS helps increase genetic gain per generation by allowing for earlier selection through improved prediction of the potential of individual plants (Daetwyler et al., 2013;Nascimento et al., 2019;Voss-Fels et al., 2019).In the field of coffee breeding, GS has been utilized with the dual aim of accelerating genetic gain through early selection and improving prediction accuracy (Sousa et al., 2019;Alkimim et al., 2020;Sousa et al., 2021;Coelho de Sousa et al., 2022).
Among several prediction models, Genomic Best Linear Unbiased Prediction (GBLUP) is the most widely used approach for genomic prediction due to its advantages (Zhang et al., 2021).Compared to other parametric methods, GBLUP allows to accurately estimate narrow-sense heritability (Li et al., 2019) and presents higher computational efficiency (Hernandez et al., 2020).GBLUP modeling is also flexible.It can be modified to incorporate additional genetic information beyond the typical single-nucleotide polymorphism (SNP) markers.Specifically, this modeling allows to account for non-additive genetic effects, environmental factors, and even genotype-by-environment interactions, enriching the analysis and potentially improving prediction accuracy (Jarquıń et al., 2014).
In the Artificial Intelligence Era, the interest in semi-and nonparametric methods for GS is increasing (Larkin et al., 2019;Coelho de Sousa et al., 2022;Seyum et al., 2022).These approaches, such as Artificial Neural Networks and Decision Trees, do not require prior assumptions about the relationships between inputs (SNP markers) and the output (phenotypic observations), allowing great flexibility to handle complex non-additive effects, such as dominance and epistasis (McKinney et al., 2006;Abdollahi-Arpanahi et al., 2020;Coelho de Sousa et al., 2022).In general, despite their potential, these approaches do not outperform the traditional parametric methods (e.g., GBLUP, Bayesian Alphabet - Gianola et al., 2009) used to predict the genetic merit of individuals (Liang et al., 2021).
Aiming to enhance predictive ability (PA), Ensemble Learning (EL) combines predictions from multiple models (base learners) into a single prediction (meta-learner) (Mendes-Moreira et al., 2012;Ganaie et al., 2022;Mienye and Sun, 2022).This approach leverages the strengths of diverse models to potentially generate more robust results compared to relying on a single learner (Liang et al., 2021;Kalule et al., 2023).In the context of GS, EL has found application through methods such as Random Forest (RF) and Bagging (Bag) (Xu et al., 2019;Abdollahi-Arpanahi et al., 2020;Sousa et al., 2021;Costa et al., 2022).These methods, categorized as Homogenous Learning (HL), utilize a single framework to produce a single prediction value.Conversely, the Stacking Ensembles Learning (SEL) approach combines predictions from diverse methods, potentially outperforming HL (Mendes-Moreira et al., 2012).SEL has seen success in GS, improving PA in Chinese Simmental cattle, Dutch cattle, and pine (Liang et al., 2021), achieving higher accuracy than GBLUP for most evaluated traits.
Despite being interesting, EL approach arises some issues that needs to be considered.First, since the same individuals are used to fitting the model(s) in the EL approach, it is expected the existence of correlation between the predictions derived from the different methods.This well-known statistical problem is referred as multicollinearity (Montgomery et al., 2021) and causes high variability of the estimated effects.The second issue is related to which dataset should be used to fit the meta-learner.One option is to use directly the predicted values derived from the base learners.In this case, the simple mean or some regression model that accounts for multicollinearity problem (e.g., Ridge Regression) can be implemented to makeup a single prediction.An alternative option also could consider combining the predicted values with the genomic covariates (i.e., SNP markers|predicted values from the base learners) with the previous training data as new inputs.In this regard, in addition to the multicollinearity, the course of the dimensionality is another issue to consider mainly because these are covariables of different type.Liang et al. (2021) used the predicted values derived from a multiple regression model as meta-learner.These authors obtained good results to improve PA compared to the conventional genomic prediction models on three different datasets.However, the use of an expanded training data augmented by SNP markers could be beneficial to further enhance the PA, and it emerges as an interesting approach.In this case, a model that addresses both multicollinearity and dimensionality problems should be used.One of the possible solutions can be considered to use a two-kernel GBLUP model as the meta-learner model.Another approach to evaluate is to consider only the predicted values provided by the best base learner models.
To date, no research has applied SEL to improve the prediction accuracy of important traits in coffee cultivars.This approach presents potential for coffee breeding, as it has been shown to outperform standard methods in other applications (Liang et al., 2021;Mohammed and Kora, 2023).By combining the strengths of multiple prediction models, SEL could lead to more reliable and accurate identification of valuable genetic traits in coffee plants, accelerating the development of superior coffee varieties.
In light of the mentioned points, the objective of this study was to use and evaluate the SEL to improve PA of important traits in Coffea Arabica.For that, the GBLUP, multivariate adaptive regression splines (MARS), Quantile Random Forest (QRF), and RF models were used as the base learner.Several approaches were considered as the meta-learner to construct the SEL framework.Specifically, the expanded-and non-expanded datasets were used for training.In addition, models that account for multicollinearity (Ridge Regression) and multicollinearity and dimensionality jointly (GBLUP) were also implemented.Nascimento et al. 10.3389/fpls.2024.1373318Frontiers in Plant Science frontiersin.org 2 Materials and methods

Phenotypic and genotypic data
The data were collected from the C. arabica breeding program, which is a joint partnership among the Agricultural Research Company of Minas Gerais (EPAMIG), the Federal University of Vicosa (UFV), and the Brazilian Agricultural Research Corporation (EMBRAPA).An experimental area is maintained at the Department of Phytopathology-UFV (20°44′25" S, 42°50′52" W).The database is composed of 13 progenies derived from crosses between three parents of the Catuaı́cultivar and three parents of the Hıbrido de Timor (HdT).Fifteen genotypes of the abovementioned progeny set (totaling 195 individuals) were genotyped for 21,211 SNP markers by Rapid Genomics, located in Gainesville Florida, USA.Information about the probes design and SNP identification are detailed in Sousa et al. (2017).The SNP markers set are widely distributed in the genome and in all coffee chromosomes, being useful for accurate studies on diversity and population structure, as well as selection and genomic association in C. arabica (Sousa et al., 2017, Sousa et al., 2019).The SNP quality control was carried out considering genotypic call rate and minor allele frequency equal to or greater than 90% and smaller than 5%, respectively.In this study a pre-selected set of 5,970 markers that did not reduce the PA of Arabica Coffee compared to the original set SNP markers in a previous study was used (Arcanjo et al., 2024).
The genotypes were planted on February 11, 2011, using a spacing of 3.0 m between rows and 0.7 m between plants following an augmented (check varieties) blocks experimental design.Nutritional management was carried out following the requirements of the crop.The phenotypic evaluations were performed in 2014, 2015, and 2016.A total of four traits were scored, two associated with the productivity, yield (YL-liters of fresh cherries harvested per plant) and total number of fruits (NF) -and two more associated with disease resistance-leaf miner infestation (LM) and cercosporiosis incidence (Cer) in Coffea Arabica.The incidence of cercosporiosis and leaf miner was evaluated using a score scale ranging from 1 to 5, in which 1 corresponded to genotypes without symptoms and 5 referred to highly susceptible genotypes.A comprehensive description of how the evaluations of each trait were performed can be found in Sousa et al. (2019).

Phenotypic data analysis
The phenotypic data for YL, NF, LM, and Cer were analyzed according to the following statistical model where y represents the vector of observed phenotypes; u is the vector referring to the general mean in each evaluation year; g is the vector of genetic random effects corresponding to the progeny such that g ∼ N(0, Is 2 g ); p is the random permanent environmental effect p ∼ N(0, Is 2 p ); r is the population random effect r ∼ N(0, Is 2 r ); b is the plot random effect b ∼ N(0, Is 2 b ); i corresponds to the random effect of the interaction between progenies and the years i ∼ N(0, I s 2 i ); and e is the experimental error assumed to be Independent and Identically Distributed (IID) outcomes from a normal density such that e ∼ N(0, Is 2 e ).The genetic parameters, heritability and correlation, were also estimated for the evaluated traits.The individual heritability was estimated by h 2 = s 2 g s 2 g +s 2 p +s 2 r +s 2 i +s 2 e .In addition, the adjusted phenotypes (y*, corrected BLUPs) for the year, plot, and year × progenies interaction effects were calculated and used for GS.The analyses were carried out using Selegen-REML/ BLUP software (de Resende, 2016).

Individual genomic prediction 2.3.1 GBLUP
The parameterization of the Genomic prediction G-BLUP model can be defined as follows where y * is the vector of adjusted phenotypic observations as previously detailed; b is the vector of means; X is the incidence matrix corresponding to the fixed effects; u is the vector of individual additive genomic effects such that u ∼ N(0, Gs 2 g ) where G is the kinship matrix describing genomic similarities between pairs of individuals, s 2 g is the additive genetic variance, Z is the incidence matrix that connect phenotypes with genotypes; e is the random error vector with e ∼ N(0, Is 2 e ) where s 2 e is the residual variance.The additive genomic kinship matrix G was obtained as described by VanRaden ( 2008) where, W is the centered (by columns) matrix of SNPs, which specifies the marker genotypes for each individual as 0, 1 or 2; p i is the frequency of the second allele at the locus, that is, The BGLR function of the BGLR package (Peŕez and de los Campos, 2014) in R software (R Core Team, 2022) was used to fitting GBLUP model.

Decision tree
The decision tree structure in this case is built using a regression tree algorithm.The objective is to create regions (R 1 , R 2 ,…, R M ) that minimize the difference between the predicted values and the adjusted observed values.This difference is measured by the Residual Sum of Squares (RSS).To achieve this, the algorithm performs a recursive binary splitting process.At each step, it considers all available features (X j − markers) and all possible split points (cutoff values) within each feature.The split that results in the lowest RSS for the resulting child nodes is chosen.This process continues recursively until a stopping criterion is met, such as reaching a minimum number of data points in a region.Mathematically, the two disjoint regions can be defined by (Hastie et al., 2009) and the goal is to minimize: where ŷ * R 1 is the average of the adjusted phenotypic values of the training observations belonging to the region R 1 (j, s) = XjX j < s È É , ŷ * R 2 is the average of the adjusted phenotypic values of the training observations belonging to the region R 2 (j, s) = XjX j ≥ s È É and y * i is the true value of each individual.

Random Forest
To construct a RF is necessary to create several datasets by resampling (bootstrapping) from the training set.After that, the bootstrap samples are used to build multiple trees considering a subset of predictors (markers) randomly selected (Hastie et al., 2009).Usually, for a continuous response, the number of predictors used to find the best split at each node is a subset that is chosen by m = v 3 , with v being the total number of predictors.Also, usually, the number of trees for the RF is set to 500.For the RF, the trees grow to their maximum size without pruning, and the prediction is done by averaging the trees.The function randomForest in randomForest R-package (Liaw and Wiener, 2002) was used to implement RF method.

Quantile Random Forest
For the construction of the QRF, as same as for RF, it is necessary to obtain T regression trees generated from bootstrap samples considering subsets of the markers under study (Hastie et al., 2009).Then, for the t th generated tree (T t ), the conditional distribution is obtained by weighting the observed values of the studied traits.Specifically, given an observation, X = x, it is defined for each terminal node (adjusted tree leaf), F(x, T tf ), the following weighting factor: w i (x, T tf ) = f g an indicator variable stating that the observed value (X = x) belongs to f th leaf and # m : X m ∈ F(x, T tf ) f g represents the number of observations on the f th leaf.
The prediction of a tree T t , according to Meinshausen (2006), for a new point, X = x new , is given by the weighted average of the observations Y i , that is, m (x new ) = o n i=1 w i (x, T tf )Y i .In this way, the prediction for a given observation, X = x, after the construction of T trees is given by m where I Y i ≤y f g is an indicator function, the predicted value for the t th quantile is given by Q t (x) = inf fy : F (yjX = x) ≥ t g, for any t, 0 < t < 1 .
The main difference between QRF and RF is that, for each node in each tree, the RF maintains only the average of the observations that fall into that node and discards any other information.
Conversely, the QRF maintains the value of all node observations (not just the average) and evaluates the conditional distribution based on this information (Meinshausen, 2006).This study evaluated nine Quantile Random Forest (QRF) for various quantile levels.The quantile parameter (t) ranged from 0.1 to 0.9 in increments of 0.1.Therefore, the models were named QRF0.1,QRF0.2,…,QRF0.9, reflecting the specific quantile they aimed to predict.The function quantregForest in quantregForest R-package (Meinshausen, 2017) was used to implement the QRF methods.

Multivariate adaptive regression splines
MARS (Friedman, 1991) forms reflexive pairs of base functions (BF) for each input (marker) X j , with nodes at each observed value x ij of that input.The model building strategy is like a progressive linear regression, but instead of using the original inputs, it implements base functions from the set C = (X j − t) + , (t ,…, p and/or its products.The MARS model, which is a linear combination of the BF and/or their interactions, is given by (Hastie et al., 2009): where b 0 is the regression constant, b m with m = 1, 2,…, M, are the regression coefficients, and h m (X) is a function in C, or a product of two or more functions.
The estimation process of the parameters b 0 and b m is based on the minimization of the residual sum of squares.First, the forward phase starts on the training data, building the model initially with only the constant function h 0 (X) = 1, and all functions in the C set are candidate functions.At each subsequent step, the base pair that produces the maximum reduction in training error is added.Considering a model with basic M functions, the next pair to be added to the model is where b M+1 and b M+2 are coefficients estimated by the least square method (Hastie et al., 2009), together with all other M + 1 coefficients in the model.This process of adding BF continues until the model reaches a predetermined maximum number, often leading to a purposefully overparametrized model (Zhang and Goh, 2016).The backward phase improves the model by removing the least significant terms until finding the best sub model.The model subsets are compared using the generalized cross-validation (GCV) method.The GCV is evaluated with the root-mean-square residual error divided by a penalty that depends on the complexity of the model (Zhang and Goh, 2016) and it is calculated as Ã 2 (Hastie et al., 2009) where M is the effective number of model parameters, C(M) is a cost function for each basis function included in the developed submodel, which by default adopts the value of 3, N is the number of datasets used in CV and f l (x i ) denotes the predicted MARS values.This study employed three Adaptive Regression Spline (MARS) models with varying degrees of interaction (1, 2, and 3).The MARS 1 model represents an additive model, meaning it captures only the linear effects of markers.In contrast, MARS 2 and MARS 3 allow for the inclusion of second and third-order interactions, respectively, enabling them to capture more complex non-additive relationships between markers.The function earth in earth Rpackage (Milborrow, 2017) was used to implement MARS models.

Stacking Ensemble Learning for genomic prediction
This study explores the SEL approach for improving the accuracy of genomic prediction models.SEL leverages predictions from multiple individual models (base learners, Level 0) and combines them using a meta-learner model (Level 1) to generate a final, potentially more accurate prediction.The base learners used in this study were GBLUP, different degrees of MARS (1, 2, 3), multiple QRFs (0.1 to 0.9), and RF.Their predictions, referred to as Genomic Estimated Breeding Values from Base Learners (GEBV-BL), formed the core metadata for the meta-learner.In practice, it is necessary to prepare a dataset with both the observable characteristics (phenotype) and the genetic information (SNP markers) of individuals.Then, diverse machine learning models (base learners) are trained on these data to make initial predictions.
These predictions from the base learners become the new features for a final model, the meta-learner.Finally, the meta-learner is trained using these base learner predictions as input and the original phenotype data as the target variable.In our work, four different combinations of metadata were explored: (i) GEBV-BL, only predictions from the base learners (standard approach); (ii) GEBV-BL+SNP, predictions combined with the original Single Nucleotide Polymorphism (SNP) markers (larger input dataset); (iii) GEBV-BL-Best; and (iv) GEBV-BL-Best + SNP, Similar to the previous cases, but only predictions from high-performing base learners (those exceeding the average predictive accuracy) were included.For GEBV-BL and GEBV-BL-Best datasets, six meta-learner methods were evaluated: Simple Mean (SSM); Weighted Regression (SWR); Regression (SR); Ridge Regression (SRR); Random Forest (SRF).For GEBV-BL + SNP and GEBV-BL-Best + SNP datasets, which included SNP markers, a two-kernel GBLUP model (S2KGBLUP) was additionally employed as the metalearner.The SEL scheme for genomic prediction is illustrated in the Figure 1.

Cross-validation
The PA of the models used as base-learners and the entire SEL process considered a CV scheme that was implemented as follows.First, the complete dataset under study was randomly divided into The stacking ensemble learning framework for genomic prediction from original data to the base learners, creating metadata for the meta-learner.Base-Learner (Level 0) is composed of the GBLUP, MARS (1°, 2°, and 3°), QRF considering nine quantiles (from 0.1 to 0.9, in steps of 0.1) and RF model.Four different meta-data were obtained: (i) GEBV-BL, only predictions from the base learners (standard approach); (ii) GEBV-BL+SNP, predictions combined with the original Single Nucleotide Polymorphism (SNP) markers (larger input dataset); (iii) GEBV-BL-Best; and (iv) GEBV-BL-Best + SNP, similar to the previous cases, but only predictions from high-performing base learners (those exceeding the average predictive accuracy).Meta-Learners: for GEBV-BL and GEBV-BL-Best metadata, six meta-learner methods were evaluated.Simple Mean (SSM), Weighted Regression (SWR), Regression (SR), Ridge Regression (SRR), Random Forest (SRF), and for GEBV-BL + SNP and GEBV-BL-Best + SNP datasets, which included SNP markers, a two-kernel GBLUP model (S2KGBLUP).
two sets (training and testing).The training set was composed by 70% of the individuals while the remaining 30% was assigned to the testing or validation set.The training set was used to calibrate the base-learners and the SEL for predicting the GEBVs of the individuals in the training set.This procedure was repeated 10 times.Then, for each approach, the average PA across replicated was computed.The PA was computed as the Pearson correlation between predicted GEBV and the adjusted phenotype values.The standard error (SE) was also computed.In addition, the mean square error (MSE) between the observed and predicted values was calculated.Finally, the agreement coefficient was used to compute the percentage of individuals with a performance above the 90 th percentile in fields given the top 10% of the GEBVs obtained with the different genomic prediction approaches.

Phenotypic data analysis
The across environments mean ( X) and standard deviation (SD) of the evaluated traits are summarized in Table 1.
The estimates of the heritability (proportion of phenotypic variability explained by the genetic component) for YL (0.30), NF (0.49), LM (0.30), and Cer (0.38) were moderate.The Spearman's correlation (lower triangle) between the adjusted phenotypic values of each pair of traits were positive and presented low to moderate values varying from 0.02 to 0.52.The higher and the lower correlation values were observed between YL and NF (0.52) and between NF and Cer (0.02, not statistically significant), respectively (Figure 2).The correlation between YL and LM, Cer and NF and LM, Cer were not statistically significant (Supplementary Figure S1).

Comparison between the base learners
Overall, none of the evaluated base learner methods outperformed the predictive performance of the others for all the evaluated traits.The estimated predictive abilities (PA) and corresponding standard deviations for the four traits (YL, NF, LM, and Cer) ranged from −0.01 (0.01) to 0.24 (0.01) and are presented in Figure 3. Specifically, for YL, NF, LM, and Cer, the highest PA values were 0.15 (0.01), 0.24 (0.01), 0.15 (0.01) and 0.24 (0.02), and these were obtained with MARS2 and QRF0.3,QRF0.7 and GBLUP methods, respectively (Figure 2).
The extreme QRF models QRF0.1, and QRF0.9 returned the highest MSE values across all the evaluated traits (Supplementary Table S1).In general, the MSE decreased as the fitted quantile model was approaching to the median model (QRF0.5).

Comparison between the Stacking Ensemble Learning approaches and GBLUP
The estimates of the PA obtained with the SEL models and the traditional genomic prediction method GBLUP model are shown in Figure 3.The results of the GBLUP model were used as benchmark since as it was mentioned it is the most convenient and used implementation in genomic prediction.
Regarding the different data sets used as input in the SEL approach, combining the predicted values obtained from base learners (GEBV-BL) with training data used to fitting the models, did not improve PA of these methods (Figure 3).Additionally, the results considering only the predicted values provided from those base learners with PA higher than mean of all base learner (GEBV-BL-Best) as input in the Level 1, returned the highest results (Figure 3).
For the four traits, the GBLUP model presented the lowest MSE values (Supplementary Table S2).As expected, since the Ridge Regression model (SRR) depends on a regularization parameter it presented a significant higher MSE (Supplementary Table S2).For this model, the MSE values were equal to 69.28 (9.18), 8390.68 (518.49),6.65 (0.80), and 3.45 (0.36) for YL, NF, LM, and Cer, respectively.The Spearman's correlation between the GEBVs obtained with the different prediction models, including the baseline GBLUP model and all the SEL models, presented positive values and these vary from low 0.25 to high 0.97 across the evaluated traits (Figures 4-7, lower triangular matrix).Low values of the Spearman's correlation were observed between the GBLUP and the other SEL methods and these were 0.25, 0.33, 0.28, and 0.30 for YL, NF, LM, and Cer, respectively (Figures 4-7, lower triangle).On the other hand, the highest correlation value (0.97) was observed between the GEBVs obtained by the Stacking Regression (SR) and the Stacking Ridge Regression (SRR) for LM (Figure 6).For each prediction method, the predicted values were ordered based on rankings then the percentage of common individuals in the top 10% between pairs of methods was computed.Overall, the GBLUP presented lower agreement with the SEL evaluated approaches (Figures 4-7, upper triangle).For instance, the agreement coefficient between the GBLUP and the SRB, SR, SR, and S2KGBLUP methods presented values varying from 0.31 to 0.48 for all evaluated traits (Figures 4-7, upper triangle).
Regarding the different data sets used as input in the SEL approach, the highest Spearman's correlations and agreements were observed between those methods that used the same kind of metadata as input in the fitting.Overall, considering these two measures, the methods were grouped into three groups (Supplementary Figures S6-S13).In general, the GBLUP was allocated into a single group.The only exception was for cercorporiosis (Cer) considering the agreement measure.Is this case, the GBLUP was allocated together with those methods that's considers only the predicted values provided from those base learners with PA higher than the mean of all base learner (GEBV-BL-Best) as input in the SEL approaches (Supplementary Figure S12).

Discussion
In this study, we used the SEL approach to improve PA of four important traits in Coffea Arabica.Two of these traits are associated with the productivity (YL and NF) and the remaining two with disease resistance (LM and Cer).The population under study is comprise of 195 genotypes of Coffea Arabica genotyped for 5,970 SNP markers.We compared the PA of different approaches used in the Level 1 of the SEL to the results obtained with the base learners [GBLUP, MARS considering degrees equal to 1, 2, and 3, QRF considering nine quantiles (from 0.1 to 0.9, every 0.1) and RF].Since the GBLUP is the most implemented prediction model (Zhang et al., 2021), their results were used as benchmark.The Predictive ability (PA) for yield (YL), total number of fruits (NF), leaf miner infestation (LM), and cercosporiosis incidence (Cer) measured in an Arabica coffee population composed of 195 individuals using a holdout validation scheme repeated 10 times.The fitted models used as base learners are: Genomic Best Linear Unbiased Predictor (GBLUP); Multivariate Adaptive Regression Splines with degrees equal to 1, 2, and 3 (MARS 1, MARS 2 and MARS 3); Quantile Random Forest evaluated at nine quantiles [(t): 0.1 to 0.9, every 0.1] -(QRF 0.1, …, QRF 0.9), and Random Forest (RF).
PA of the different approaches was assessed using a CV scheme repeated 10 times.The Spearman's correlation and the agreement (based on the top 10%) coefficients between the GEBV values of the different models were also estimated.The genetic parameters were also estimated for the evaluated traits (YL, NF, LM, and Cer).
The machine learning methods as base learners have been already used in genomic prediction (Long et al., 2011;Lenz et al., 2019;Montesinos-Loṕez et al., 2019;Coelho de Sousa et al., 2022;Costa et al., 2022).However, generally these methods do not outperform significantly the traditional genomic prediction approach based on parametric models such as GBLUP and Bayesian Alphabet (Liang et al., 2021).Liang et al. (2021) used the SEL for improving PA in three real datasets on average by 7.70%, compared to GBLUP.The SEL uses predicted values from different machine learning implementations to obtain a single prediction value.These authors integrated/ combined the results of three machine learning implementations (Support Vector Machine, Kernel Ridge Regression and Elastic Net) to compute the GEBVs.
In contrast to Liang et al. (2021), in our study, the GBLUP approach was used as one of the base learner methods for the SEL too.The GBLUP was considered since it is widely used for genome prediction (Zhang et al., 2021) due to its reduced computational demand and simplicity (Hernandez et al., 2020) compared to the other parametric methods (e.g., Bayesian Alphabet, Gianola et al.,  Nascimento et al. 10.3389/fpls.2024.1373318Frontiers in Plant Science frontiersin.org(2017).These authors found that the Quantile Regression approach outperform the traditional genomic prediction methods of not normal distributed traits.
An interesting approach to address the non-normality assumption is using multiple models to conduct the predictions, and then combine the predicted values to makeup a single prediction through the SEL approach.In general, the SEL outperforms the methods based on base learners only (Liang et al., 2021;Kandel et al, 2021;Kalule et al., 2023).In our study, the SEL approach outperformed all base learner methodologies (Figures 2, 3).However, it is important to emphasize that these results were observed by those SEL models that used only the predicted values provided from the base learners with PA higher than mean of all the base learners.Specifically, the Stacking Mean Best (SMB) presented the highest PA for all of the evaluated traits.The average of the predictions from several fitted models has been successfully implemented with Bagging and RF approaches (Breiman, 1996 andBreiman, 2001).The SEL approach allow to use several models to combine the predicted values, for example, XGBoost (Ghasemieh et al., 2023), Penalized methods (Kalule et al., 2023), Linear Regression (Liang et al., 2021).Similar to the single model approach, the performance of the different SEL implementation can vary from one data set to another with no one of these outperforming the others in all data sets.Thus, as it was recommended by these authors it is important to evaluate several models as meta-learners as well.
Regarding the MSE, as expected, the penalized models, showed larger values compared to the other evaluated methods.By design these methods induce bias aiming to reduce the variance of the estimations (Montgomery et al., 2021;Chan et al., 2022).However, these cannot guarantee the increasing of the PA compared to other methods.The SMB, which resulted to return the best results in terms of PA, also presented large values for the MSE.This can be a consequence that SMB-SEL uses predicted values derived from base learners that return large MSE values (Supplementary Table S1).
Overall, the SEL models presented moderate to high Spearman's correlation between them (Figures 4-6).On the other hand, these were low to moderate between SEL approaches and the GBLUP model.Additionally, among the 10% of genotypes with the highest GEBVs for YL, NF, LM, and Cer, the agreement coefficient between Spearman's correlation between the genomic estimated breeding values [GEBV] (lower diagonal matrix) and the concordance coefficient between the top 10% of the selected individuals (upper triangular matrix) considering all the different fitted models including the GBLUP model and the metalearners for total number of fruits (NF).The fitted models used as base learners are Stacking Simple Mean (SSM), Stacking Weighed Regression (SWR), Stacking Regression (SR), Stacking Ridge Regression (SRR), and the Stacking two-kernel GBLUP model (S2KGBLUP).The models named as best (SSMBest, SWRBest, SRBest, SRRBest, S2KGBLUP, and RFBest) used in the fitting only the results provided by those methods that presented predictive ability higher the mean in the Level 0. the SEL and GBLUP models showed values varying from moderate to high, suggesting differences in the obtained classifications with these.In general, the cluster analysis of these results showed that the methods can be grouped into three distinct groups (Supplementary Figures S6-S13) with the GBLUP forming a group by itself.Altogether, these results show that the use of SEL to predict the individual genetic merit of four important traits in Arabica Coffee is worth to investigate.The SEL approach showed higher estimates of PA compared with all evaluated base learning methods, in special to the traditional GBLUP method.In practice, SEL's ability to combine methods with diverse characteristics facilitates a more comprehensive exploration of the relationships between variables leading to more accurate selection of breeding.This approach considers a wider range of factors and reduces the reliance on any single model's limitations.However, evaluating phenotypes across multiple environments can pose challenges for SEL.Unlike GBLUP, which presents higher computational efficiency (Hernandez et al., 2020), many base learners in SEL methods are based on machine learning requiring significant computation time in certain scenarios.Studies have explored the use of single machine learning methods for multienvironment trials (METs).For example, Barreto et al. (2024) applied machine learning to predict hybrid performance in METs and achieved similar PA compared to GBLUP with non-additive effects.As highlighted by Montesinos Loṕez et al. (2022) in their study using RF for METs, training any machine learning model can be computationally demanding, especially for datasets where the training data sets are very large.The hyperparameter tuning for individual base learners within a SEL framework is a well-established approach to enhance model performance.However, it is important to acknowledge that SEL ensembles can achieve strong results even with default base learner parameters (Friedel et al., 2023).This aligns perfectly with the core principle of ensemble learning, that is, leveraging predictions from multiple models can outperform any single model.In our study, the high dimensionality of the data presented significant computational challenges for hyperparameter tuning.Additionally, the observed superiority of the SEL approach compared to traditional methods suggested that tuning might not be as critical for achieving good results.

Conclusion
The SEL method was able to predict the PA of important traits (YL, NF, leaf miner infesting, cercosporiosis resistance) in Coffea Arabica.In addition, SEL presented higher PA compared with those obtained for all base learner methods (GBLUP, MARS considering degrees equal to 1, 2, and 3, QRF considering nine quantiles, from 0.1 to 0.9, every 0.1 and RF).

FIGURE 3
FIGURE 3Predictive ability (PA) for yield (YL), total number of fruits (NF), leaf miner infestation (LM), and cercosporiosis incidence (Cer) measured in an Arabica coffee population composed of 195 individuals using a holdout validation scheme repeated 10 times.The fitted models used as base learners are: Stacking Simple Mean (SSM), Stacking Weighed Regression (SWR), Stacking Regression (SR), Stacking Ridge Regression (SRR), and the Stacking twokernel GBLUP model (S2KGBLUP).The models named as best (SSMBest, SWRBest, SRBest, SRRBest, S2KGBLUP, and RFBest) used in the fitting only the results provided by those methods that presented predictive ability higher the mean in the Level 0.

FIGURE 6
FIGURE 6Spearman's correlation between the genomic estimated breeding values [GEBV] (lower diagonal matrix) and the concordance coefficient between the top 10% of the selected individual's upper triangular matrix) considering all the different fitted models including the GBLUP model and the metalearners for leaf minor infestation (LM).The fitted models used as base learners are: Stacking Simple Mean (SSM), Stacking Weighed Regression (SWR), Stacking Regression (SR), Stacking Ridge Regression (SRR), and the Stacking two-kernel GBLUP model (S2KGBLUP).The models named as best (SSMBest, SWRBest, SRBest, SRRBest, S2KGBLUP, and RFBest) used in the fitting only the results provided by those methods that presented predictive ability higher the mean in the Level 0.